Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | Shaked Silverman | 206232753 | shaked.s@campus.technion.ac.il |
| Student 2 | Amit Levi | 207422650 | amitlevi@campus.technion.ac.il |
In this assignment we'll explore deep reinforcement learning. We'll implement two popular and related methods for directly learning the policy of an agent for playing a simple video game.
hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.
import pandas as pd
import seaborn as sns
In this part you'll implement a small comparative-analysis project, heavily based on the materials from the tutorials and homework.
project/ directory. You can import these files here, as we do for the homeworks.TACO is a growing image dataset of waste in the wild. It contains images of litter taken under diverse environments: woods, roads and beaches.
from IPython.display import Image
Image('imgs/taco.png')
you can read more about the dataset here: https://github.com/pedropro/TACO
and can explore the data distribution and how to load it from here: https://github.com/pedropro/TACO/blob/master/demo.ipynb
The stable version of the dataset that contain 1500 images and 4787 annotations exist in datasets/TACO-master
You do not need to download the dataset.
Good luck!
As the task consists of object detection, and with inspiraton from part 6 of HW2's YOLOv3, we've choosed YOLOv8 as a model, which is currently (unless YOLOv9 somehow shows up by the time this sentence would be read) the state of the art in terms of object detection. In addition for its SoT traits, its API is easy to use and the prediction results are neatly saved in a proper directory.
YOLOv8 consits of 24 convolutional layers (CNNs) followed by 2 fully connected layers (FCs).Of the 24 layers, 7 are regular CNN layers, and between each of them there's a C2F layer (8 in total) and lastly a single SPP layer is in the middle.
C2F(Coarse2Fine) layers addition are a new change from previous YOLO models. Each layer recieves the output of a CNN layer, whereas the output is devided to many different equal sized rectangles shaped mini images. Then, each image is converted to HSV histogram (AKA a historgram depicting the color distribution of the image). For each training class, a learnable query image is being fed to the C2F layer, which is also coverted to HSV histogram. Afterwards, each of the mini images is compared to the query image by cosine similarity, and the most similar mini-images to the query image would be picked, along their location in the bigger input image. This layer assists the model to concentrate on specific desired objects within the entire image, by querying the desired identification object and comparing it to the various image parts.
The middle SPP(Spatial Pyramid Pooling) layer exists since YOLOv3. By using spatial pyramid pooling net(SPP-net), rather than regular CNN layers requiring repeatedly computing of the convolutional features, this layer computes all feature maps from the entire map in one go and saving precious time. SPPF layer used in YOLOv8 is an enhanced version of SPP, using less FLOPs thus improving its efficiency.
YOLOv8 uses DFL(distributional focal loss) as its loss function. A focal loss function is a dynamically scale cross entropy loss which concentrates on difficult samples, and automatically down-weight the weight of easier examples for this task. When a sample metric is closer to the threshold it would have significantly higher effect than a sample which was far away from the threshold. By using the distributional aspect, the loss function can take into acount multiple samples at the same time, which is beneficial for object detecting whereas multiple objects can be presented at the same time in a single image.
YOLOv8, and all YOLO models in general, uses IoU (intersection over Union) metric to evaluate its performance and behave. As also explained in the tutorial for YOLO, IoU is calculated by dividing the AoO(Area of Overlap) by the AoU(Area of Union). As the images AoO increaces, hence the IoU value is increased as well. Due to it being divided by AoU, even if an annotation box completely surround the desired object, the IoU value would be less than 1, forcing the model to learn how to minimize the annotation box to raise the value of IoU and as result also make the annotation box accurately and tightly boxing the desired object.
YOLO8v and YOLO models in general main metrics are mAP50 and mAP50-95. mAP (Mean Average Precision), as explained in the YOLO tutorial, calculated as the area under the precision & recall curve of the model's predictions. In contrary to class prediction, the recall and precision are calculated by the value of IoU, whereas value above 0.5 would be positive and value under 0.5 would be considered negative. However, this threshold of 0.5 is a bit constraining, and may introduce an undesireable bias towards mediocre results. Therefore, in addition to mAP50, we also observe the mAP50-95 metric, which checks the precision and recall values for various threshoulds between 0.5 to 0.95, and then returns the average of them all, giving a potentially better representation of the model robustness
As YOLOv8 was already trained on object detection (and not the TACO dataset), we wanted to utillize that capability and create better results. The images' dimentions it got trained on were with 640x640 input, hence we wanted to alter the size of the images of the dataset accordignly. Furthermore, we stumbled upon a labeling problem whereas there are 60 labels in the data instead of the desired 7 labels. In Addition, some annotation's coordinates were inaccurate in terms they were bigger (or smaller) than the original possible coordinates of the image itself.
To tackle this problem we've used RoboFlow, an online tool where images and relevant annotation files can be uploaded, altered by size, annotation labels and division for training, evaluation and test sub-datasets easily.
With this tool, we've done the following:
In order to get better results, we've decided to use the provided YOLOv8n model, with pretrained weights on object detection in general. We firstly trained the database for 100 epochs on default parameters as a base for improvement (Optimizer: SGD, Learning Rate: 0.1, Momentum:0.937).
Then, we decided to do a 3-dimension hyper-parameter grid-search, in order to find the optimal optimizer, learning rate and momentum hyper parameters.
We've chosen the optimizers SGD, Adam and AdamW, with the varying learning rates: 0.01, 0.005, 0.001, 0.0005, 0.0001 and momentums: 0.1, 0.5, 0.9, whereas in Adam and AdamW momentum stands for the beta1 hyper-parameter.
Each combination of optimizer, learning rate and momentun was run for 5 epochs, and both mAP50 and mAP50-95 metrics were picked for each run combination.
As we're still not 4th dimentional perception capable beings, we've separate the results to 2-d tables for easy analysis.
df = pd.read_csv('./project/SGD_50.csv', index_col=0)
ax = sns.heatmap(df, annot=True)
r = ax.set(xlabel="Momentum", ylabel="Learning Rate", title="SGD mAP50")
df = pd.read_csv('./project/SGD_50-95.csv', index_col=0)
ax = sns.heatmap(df, annot=True)
r = ax.set(xlabel="Momentum", ylabel="Learning Rate", title="SGD mAP50-95")
df = pd.read_csv('./project/ADAM_50.csv', index_col=0)
ax = sns.heatmap(df, annot=True)
r = ax.set(xlabel="Beta", ylabel="Learning Rate", title="Adam mAP50")
df = pd.read_csv('./project/ADAM_50-95.csv', index_col=0)
ax = sns.heatmap(df, annot=True)
r = ax.set(xlabel="Beta", ylabel="Learning Rate", title="Adam mAP50-95")
df = pd.read_csv('./project/ADAMW_50.csv', index_col=0)
ax = sns.heatmap(df, annot=True)
r = ax.set(xlabel="Beta", ylabel="Learning Rate", title="AdamW mAP50")
df = pd.read_csv('./project/ADAMW_50-95.csv', index_col=0)
ax = sns.heatmap(df, annot=True)
r = ax.set(xlabel="Beta", ylabel="Learning Rate", title="AdamW mAP50-95")
We've decided to pick the optimizer AdamW(lr = 0.005, beta=0.5) as its results was the highest among them all. We've run each for 100 epochs.
Below is the code used for training and for validating. After the training was done, the yaml file was edited so the test folder would be the validate folder, effectivily utilizing the "val" function to predict all test samples (the actual validation is being commenced straight after the training automatically, hence this is the actual testing part). All result graphs are generated by YOLOv8.
from ultralytics import YOLO yaml_file_path = '/path/to/yaml/file/data.yaml' model_path = '/path/to/model/file/yolov8n.pt' Set pre-trained model weights: model = YOLO(model_path) # Training: model.train(data=yaml_file_path, epochs=100, imgsz=640, workers=2) # Validating: best_model_path = 'path/to/best/model/file/weights/best.pt' model = YOLO(best_model_path) metrics = model.val(data = yaml_file_path)
As mentioned before, we first trained the model without any parameters alterations, rather than the pre-weights supplied with the provided model, as the base results for comparison purposes. We first observe the confusion matrix for the labels:
Image('project/confusion_matrix_no_opt.png',width=1000, height=750)
We barley had any sample labeled 'bio' or 'other' in the dataset, hence it's no suprise the model couldn't learn these category at all. Not suprisingly, the most prominent label, 'metals_and_plastic', has also the best prediction accuracy of 0.47. It is also the prominent label that is given to an object, given it could be succesfully identified as not being background(AKA the model identified the existence of it as an object).
Image('project/PR_curve_no_opt.png',width=675, height=450)
Similarly to the confusion matrix, a nice curve can be seen for 'metals_and_plastic' label, whereas for all the other labels, results are rather poor, due to their abudance in the dataset.
various losses and metric graphs:
Image('project/results_no_opt.png',width=1000, height=500)
Image('project/confusion_matrix_AdamW.png',width=1000, height=750)
Although a slight decrease in accuracy (0.41 against 0.47) is presented for the label 'metal_and_plastic', other labels got significant improvent, for example, 'paper' (from 0.2 to 0.37) and 'non_recyclable' (from 0.06 to 0.16). We infer this behaviour is because given the better optimum reached with AdamW than the base model, with respect to the loss function of DFL, the model gave less significance to the 'metal_and_plastic' labeled samples which were easy to identify, and in return focused more on difficult ones like the label 'paper'.
Image('project/PR_curve_AdamW.png',width=675, height=450)
As expected (as we already observed this behaviour in the confusion matrix), the curve of 'metals_and_plastic' quality decreased over the base training. The curve of 'paper' was significantly improved, as seen at the 0.4-0.6 precision and recall area. Even 'glass' label is showing nice improvment to its curve.
various losses and metric graphs:
Image('project/results_AdamW.png',width=1000, height=500)
As can be seen, in comparance to the base model, significant recall improvement, better dfl loss, and therefore also better box loss. All due to better mAP results.
Image('project/results_compare.jpg')
As can be seen, due to the accuracy decrease of 'metal_and_plastic' label, sometimes objects in AdamW are wrongly classified whereas with no special optimization, with the higher accuracy for that label there are more correct classifications. However, for example, due to the increased mAP in AdamW, the drinking cup in the example above was succefully identified(and partially classified) by AdamW whereas the base training could not! This is seen again in the picture the the left of the drinking cup whereas AdamW manages to identify objects the base training couldn't (although adding some false positive, possibly imagianary object to the identifiction). This can be seen yet again in the sand image at the bottom.
This section contains summary questions about various topics from the course material.
You can add your answers in new cells below the questions.
Notes
====================================================================== ANSWER:
In Convolutional Neural Networks (CNNs), a receptive field refers to the portion of the input image that a single neuron in a layer is "looking at". Each neuron's receptive field is determined by the size of its convolutional kernel, the number of layers in the network, and the stride with which the kernel moves across the input image. The receptive field grows with each subsequent layer, as each neuron receives input from a larger region of the previous layer.
====================================================================== ANSWER:
There are several ways to control the rate at which the receptive field grows from layer to layer in CNNs. The first approach is to use smaller convolutional kernels (such as 3x3) and increase the number of layers in the network. This approach is called "deepening" and has the advantage of increasing the non-linearity of the network, since each layer introduces a non-linear activation function. By using smaller kernels, the receptive field grows more slowly, but more layers are needed to cover the same region of the input image.
The second approach is to use pooling layers between the convolutional layers. Pooling layers reduce the spatial dimensionality of the input, typically by taking the maximum or average value over a small region (such as 2x2) of the previous layer. This has the effect of increasing the receptive field of each neuron in the next layer, since they are now looking at a larger region of the input. However, pooling layers also reduce the resolution of the input, which can result in loss of information.
The third approach is to use dilated convolutions, also known as atrous convolutions. Dilated convolutions insert gaps between the values in the convolutional kernel, effectively increasing the size of the kernel without increasing the number of parameters. This has the effect of increasing the receptive field of each neuron, while still maintaining a high spatial resolution of the input. However, dilated convolutions can result in a more sparse representation of the input, which may reduce the performance of the network.
In terms of how they combine input features, deepening and dilated convolutions both combine input features in a local, dense manner. Pooling, on the other hand, combines features in a more global, sparse manner by taking the maximum or average value over a larger region of the input.
====================================================================== ANSWER
The CNN with three convolutional layers can be defined as follows: Layer 1: 32 filters with a 3x3 kernel, ReLU activation, and padding Layer 2: 64 filters with a 3x3 kernel, ReLU activation, and padding Layer 3: 128 filters with a 3x3 kernel, ReLU activation, and padding In layer 1, each filter will have a receptive field of 3x3 pixels, meaning each neuron is looking at a 3x3 patch of the input. In layer 2, each neuron will have a receptive field of 5x5 pixels, since it receives input from a 3x3 patch of the previous layer. Finally, in layer 3, each neuron will have a receptive field of 7x7 pixels, since it receives input from a 3x3 patch of the previous layer.
To interpret the performance of the network, one would need to consider the dataset being used, the objective of the task, and the evaluation metrics being used. However, in general, deeper networks with larger receptive fields tend to perform better on tasks that require a high degree of spatial abstraction, such as object recognition or semantic segmentation.
import torch
import torch.nn as nn
cnn = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=4, out_channels=16, kernel_size=5, stride=2, padding=2),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(in_channels=16, out_channels=32, kernel_size=7, dilation=2, padding=3),
nn.ReLU(),
)
cnn(torch.rand(size=(1, 3, 1024, 1024), dtype=torch.float32)).shape
torch.Size([1, 32, 122, 122])
What is the size (spatial extent) of the receptive field of each "pixel" in the output tensor?
====================================================================== ANSWER:
In a Convolutional Neural Network (CNN), the size or spatial extent of the receptive field of each "pixel" in the output tensor depends on the architecture of the network, specifically the number of layers, the size of the convolutional kernels, and the stride with which they move across the input. As we move deeper into the network, each neuron's receptive field grows larger, meaning that it "sees" a larger portion of the input. This is because each neuron in a given layer is connected to a larger region of the previous layer, which is determined by the size of the kernel and the stride. The size (spatial extent) of the receptive field of each "pixel" in the output tensor can be computed as follows: After the first convolutional layer with kernel size 3 and padding 1, the output tensor will have spatial dimensions of 1024x1024, and each pixel will have a receptive field of size 3x3. After the first max pooling layer with kernel size 2, the output tensor will have spatial dimensions of 512x512, and each pixel will have a receptive field of size 4x4 (2x2 from the max pooling operation, and 3x3 from the previous convolution). After the second convolutional layer with kernel size 5, stride 2, and padding 2, the output tensor will have spatial dimensions of 256x256, and each pixel will have a receptive field of size 12x12 (2x2 from the max pooling operation, and 5x5 from the previous convolution). After the second max pooling layer with kernel size 2, the output tensor will have spatial dimensions of 128x128, and each pixel will have a receptive field of size 16x16 (2x2 from the max pooling operation, and 12x12 from the previous convolution). After the third convolutional layer with kernel size 7, dilation 2, and padding 3, the output tensor will have spatial dimensions of 128x128, and each pixel will have a receptive field of size 36x36 (16x16 from the previous max pooling, and 7x7 from the previous convolution with dilation).
You have trained a CNN, where each layer $l$ is represented by the mapping $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)$, and $f_l(\cdot;\vec{\theta}_l)$ is a convolutional layer (not including the activation function).
After hearing that residual networks can be made much deeper, you decide to change each layer in your network you used the following residual mapping instead $\vec{y}_l=f_l(\vec{x};\vec{\theta}_l)+\vec{x}$, and re-train.
However, to your surprise, by visualizing the learned filters $\vec{\theta}_l$ you observe that the original network and the residual network produce completely different filters. Explain the reason for this.
ANSWER:
The reason for the observed differences in learned filters between the original CNN and the residual CNN lies in the way the residual connections affect the optimization process. In the original CNN, each layer is optimized to produce the desired output directly from its input, without any shortcuts or additional inputs. In the residual CNN, each layer is optimized to produce the desired output by adding the input to the result of its convolutional operation. This means that the optimization process in the residual CNN can take advantage of the residual connections to skip over difficult regions of the optimization landscape, and focus on learning more complex and meaningful representations. As a result, the learned filters in the residual CNN may be more diverse, specialized, and effective than those in the original CNN, since they can leverage both the input information and the residual information to improve their performance. However, this also means that the learned filters in the residual CNN may not be directly comparable or interpretable with those in the original CNN, since they represent different optimization objectives and strategies.
import torch.nn as nn
p1, p2 = 0.1, 0.2
nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=4, kernel_size=3, padding=1),
nn.ReLU(),
nn.Dropout(p=p1),
nn.Dropout(p=p2),
)
Sequential( (0): Conv2d(3, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): ReLU() (2): Dropout(p=0.1, inplace=False) (3): Dropout(p=0.2, inplace=False) )
If we want to replace the two consecutive dropout layers with a single one defined as follows:
nn.Dropout(p=q)
what would the value of q need to be? Write an expression for q in terms of p1 and p2.
====================================================================== ANSWER:
In order to replace the two consecutive dropout layers with a single one, we need to find the equivalent drop probability q that would have the same effect as applying p1 and p2 consecutively. This can be computed as follows:
q = 1 - (1 - p1) * (1 - p2)
====================================================================== ANSWER:
====================================================================== ANSWER:
After applying dropout with a drop-probability of p, the activations are scaled by 1/(1-p) in order to maintain their expected value unchanged. To see why this is the case, consider a single activation a that is either kept with probability 1-p or set to zero with probability p. The expected value of this activation can be computed as follows:
E[a] = (1-p) a + p 0 = (1-p) * a
If we want to maintain the expected value of a unchanged after applying dropout, we need to scale it by 1/(1-p) to compensate for the reduction in the number of active units. This means that the actual activation a' after dropout will be given by:
a' = a * mask / (1-p)
where mask is a binary mask that determines which units are kept and which are dropped. By multiplying a by mask, we set the dropped units to zero and keep the active units unchanged, while the scaling factor of 1/(1-p) ensures that the expected value of a' is equal to the expected value of a before dropout.
====================================================================== ANSWER:
No, an L2 loss is not appropriate for training a binary classifier like the one described here. The L2 loss, also known as the mean squared error (MSE), is a regression loss function that measures the average squared difference between the predicted and target values. It is commonly used for problems where the output is a continuous variable, such as predicting a numeric value or a probability. However, for a binary classification problem like this, the output is a discrete variable with only two possible values, so using a regression loss like L2 would not be suitable. The reason that using L2 loss is not appropriate for binary classification problems is that the output of the model is a probability distribution over the classes (in this case, dog and hotdog), rather than a continuous value. L2 loss is designed for continuous output values, and it tries to minimize the difference between the predicted and true values by penalizing the squared differences.
In binary classification, a common loss function to use is the binary cross-entropy loss, also known as log loss. This loss function is designed to measure the difference between two probability distributions, in this case, the predicted probability distribution and the true probability distribution. The binary cross-entropy loss works by taking the negative log likelihood of the predicted probability of the correct class.
Here's an example to illustrate why L2 loss is not appropriate for binary classification. Let's say we have a model that outputs a probability distribution over the classes, and we want to classify an image as either a cat (output 0) or a dog (output 1). The true label for an image is a dog, so the true probability distribution is [0, 1].
If we train the model with L2 loss, and the model outputs [0.5, 0.5], the L2 loss would be (0.5-0)^2 + (0.5-1)^2 = 0.5. However, this does not reflect the fact that the model is uncertain and doesn't strongly predict either class. In contrast, the binary cross-entropy loss would penalize the model for being uncertain and not strongly predicting the true class.
Instead, we can use a binary cross-entropy loss, which is a commonly used loss function for binary classification problems. The binary cross-entropy loss measures the difference between the predicted probability and the target probability for a binary classification problem. It is defined as:
L = -[ylog(p) + (1-y)log(1-p)]
where y is the ground-truth label (0 for dog and 1 for hotdog), p is the predicted probability of the positive class (hotdog), and log is the natural logarithm.
To illustrate the difference between L2 and binary cross-entropy losses, consider a simple example where we have two training examples and their corresponding true labels and model predictions:
example True Label Model Prediction 1 0 0.8 2 1 0.2
If we use an L2 loss to train the model, the loss would be computed as the mean squared error between the true labels and the predicted values:
L2 loss = (0 - 0.8)^2 + (1 - 0.2)^2 = 1.16
This loss does not reflect the fact that the model is making correct predictions for both examples, but is just not confident in its predictions. Using an L2 loss could lead the model to assign equal weights to both examples, which could result in suboptimal performance.
On the other hand, if we use a binary cross-entropy loss to train the model, the loss would be computed as follows:
Binary cross-entropy loss = -[0log(0.8) + (1-0)log(1-0.8)] - [1log(0.2) + (1-1)log(1-0.2)] = 0.965
This loss penalizes the model more for making incorrect predictions and rewards it more for making correct predictions with higher confidence. It is a more suitable loss function for binary classification problems.
In summary, L2 loss is not appropriate for binary classification problems because the output of the model is a probability distribution over the classes, and L2 loss is designed for continuous output values. Instead, binary cross-entropy loss is a better choice because it measures the difference between two probability distributions and penalizes the model for being uncertain and not strongly predicting the true class.
Therefore, we should use a binary cross-entropy loss to train a binary classifier like the one described in the problem statement.
Image('https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/PiratesVsTemp%28en%29.svg/1200px-PiratesVsTemp%28en%29.svg.png')
You decide to train a cutting-edge deep neural network regression model, that will predict the global temperature based on the population of pirates in N locations around the globe.
You define your model as follows:
import torch.nn as nn
N = 42 # number of known global pirate hot spots
H = 128
mlpirate = nn.Sequential(
nn.Linear(in_features=N, out_features=H),
nn.Sigmoid(),
*[
nn.Linear(in_features=H, out_features=H), nn.Sigmoid(),
]*24,
nn.Linear(in_features=H, out_features=1),
)
While training your model you notice that the loss reaches a plateau after only a few iterations. It seems that your model is no longer training. What is the most likely cause?
====================================================================== ANSWER:
The most likely cause for the plateau in loss after only a few iterations is the vanishing gradient problem. This problem arises when gradients in the backpropagation algorithm become too small to effectively update the weights in the earlier layers of the network. As a result, the weights in these layers remain largely unchanged, leading to a stagnant or plateauing training process.
In the given model, the repeated use of the Sigmoid activation function may be causing the vanishing gradient problem. The Sigmoid function has a maximum gradient of 0.25, which means that as backpropagation proceeds through the layers of the network, the gradients can become exponentially small. This makes it difficult to update the weights in the earlier layers of the network, and can lead to a plateau in training.
To address this issue, one potential solution is to use an activation function with a larger maximum gradient, such as the Rectified Linear Unit (ReLU). Another solution could be to use normalization techniques, such as Batch Normalization or Layer Normalization, which help stabilize the gradient flow through the network.
In addition, the architecture of the given model may be too deep, with 24 hidden layers. Deep neural networks are more prone to the vanishing gradient problem, especially when using certain activation functions. In this case, reducing the number of layers or using skip connections (such as in a ResNet architecture) could help alleviate the issue.
Overall, the vanishing gradient problem is a well-known challenge in deep learning, and addressing it requires careful consideration of the model architecture and the activation functions used.
sigmoid activations with tanh, it will solve your problem. Is he correct? Explain why or why not.====================================================================== ANSWER:
Is it correct to replace the sigmoid activations with tanh to solve the problem? Explain why or why not. Replacing the sigmoid activations with tanh may or may not solve the problem of the plateau in loss during training. The tanh activation function is similar to sigmoid in that it is also sigmoidal in shape, but it is centered around zero and ranges from -1 to 1, instead of 0 to 1. The advantage of tanh over sigmoid is that it can output negative values, which can be useful in some situations.
However, in this case, the choice of activation function is unlikely to be the primary cause of the plateau in loss during training. The mlpirate model has a large number of layers (25), and the repeated use of the same activation function can lead to the saturation of the gradients. This saturation can cause the gradients to vanish or explode, making it difficult for the optimization algorithm to update the weights effectively. Therefore, replacing the activation function may not be sufficient to overcome this problem.
====================================================================== ANSWER:
True or false: In a model using exclusively ReLU activations, there can be no vanishing gradients; The gradient of ReLU is linear with its input when the input is positive; ReLU can cause "dead" neurons, i.e. activations that remain at a constant value of zero.
a) False. While ReLU activations are known to alleviate the problem of vanishing gradients, they can still occur in deep networks that use exclusively ReLU activations. When the input to a ReLU activation is negative, the gradient is zero, which can cause the gradients to vanish during backpropagation.
b) True. When the input to a ReLU activation is positive, the gradient is equal to 1, which means that the gradient is linear with respect to its input.
c) True. ReLU can cause "dead" neurons, which are neurons that always output zero, regardless of the input. This can happen if the bias term is set such that the weighted input is always negative. In this case, the gradient of the neuron is always zero, and the neuron remains inactive. Dead neurons can significantly reduce the capacity of a neural network and are often a problem in deep networks. One way to address this issue is to use variants of ReLU, such as leaky ReLU or ELU, which
Answer:
Answer: We would expect the number of iterations to converge to $l_0$ should decrease when using the new mini-batch size, as the larger batch size has a more precise estimate of the gradient, which leads to a more informative update direction for the model parameters. When updating the model based on a larger number of training examples, this reduces the noise in gradient updates, which further improves convergence times. Additionally, with more memory available, computations can be done in parallel, further reducing the time it takes to train the model.
Answer:
In tutorial 5 we saw an example of bi-level optimization in the context of deep learning, by embedding an optimization problem as a layer in the network. True or false: In order to train such a network, the inner optimization problem must be solved with a descent based method (such as SGD, LBFGS, etc). Provide a mathematical justification for your answer.
Answer: False. The inner optimization problem can be solved using any optimization method, as long as it is differentiable with respect to the parameters of the neural network.
Answer:
You wish to train the following 2-layer MLP for a binary classification task: $$ \hat{y}^{(i)} =\mat{W}_2~ \varphi(\mat{W}_1 \vec{x}^{(i)}+ \vec{b}_1) + \vec{b}_2 $$ Your wish to minimize the in-sample loss function is defined as $$ L_{\mathcal{S}} = \frac{1}{N}\sum_{i=1}^{N}\ell(y^{(i)},\hat{y}^{(i)}) + \frac{\lambda}{2}\left(\norm{\mat{W}_1}_F^2 + \norm{\mat{W}_2}_F^2 \right) $$ Where the pointwise loss is binary cross-entropy: $$ \ell(y, \hat{y}) = - y \log(\hat{y}) - (1-y) \log(1-\hat{y}) $$
Write an analytic expression for the derivative of the final loss $L_{\mathcal{S}}$ w.r.t. each of the following tensors: $\mat{W}_1$, $\mat{W}_2$, $\mat{b}_1$, $\mat{b}_2$, $\mat{x}$.
Answer:
The derivative of the final loss with respect to each tensor is: $$ \frac{\partial L_{\mathcal{S}}}{\partial \mat{W}_2}=\frac{1}{N} (\varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)-y^{(i)})\left( \varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)\right)^\top + \lambda \mat{W}_2 $$
$$ \frac{\partial L_{\mathcal{S}}}{\partial \mat{W}_1}=\frac{1}{N} \mat{W}_2^\top (\varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)-y^{(i)})\odot \varphi'(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1) \left(\vec{x}^{(i)}\right)^\top +\lambda \mat{W}_1 $$$$ \frac{\partial L_{\mathcal{S}}}{\partial \vec{b}_2}=\frac{1}{N}\sum_{i=1}^{N} (\varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)-y^{(i)}) $$$$ \frac{\partial L_{\mathcal{S}}}{\partial \vec{b}_1}= \frac{1}{N}\sum_{i=1}^{N}\mat{W}_2^\top(\varphi(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1)-y^{(i)})\odot\varphi'(\mat{W}_1 \vec{x}^{(i)}+\vec{b}_1) $$$$ \frac{\partial L_{\mathcal{S}}}{\partial \mat{x}}=\frac{1}{N}\mat{W}_2^\top(\varphi(\mat{W}_1\vec{x}^{(i)}+\vec{b}_1) - y^{(i)})\odot\varphi'(\mat{W}_1\vec{x}^{(i)}+\vec{b}_1)\mat{W}_1
The derivative of a function $f(\vec{x})$ at a point $\vec{x}_0$ is $$ f'(\vec{x}_0)=\lim_{\Delta\vec{x}\to 0} \frac{f(\vec{x}_0+\Delta\vec{x})-f(\vec{x}_0)}{\Delta\vec{x}} $$
Explain how this formula can be used in order to compute gradients of neural network parameters numerically, without automatic differentiation (AD).
What are the drawbacks of this approach? List at least two drawbacks compared to AD.
Answer:
This formula can be used to compute gradients of neural network parameters numerically by evaluating the above quotient with a small perturbation $\Delta\vec{x}$ around each parameter value. Specifically, the gradient of a scalar function $f$ with respect to a parameter $\theta$ can be approximated numerically using: $$ \frac{df(\theta)}{d\theta} \approx \frac{f(\theta+\epsilon)-f(\theta-\epsilon)}{2\epsilon}, $$ where $\epsilon$ is a small scalar representing the perturbation size. This approximation can be computed for each parameter of a neural network, to obtain its numerical gradient.
The main drawbacks of this approach compared to AD are:
loss w.r.t. W and b using the approach of numerical gradients from the previous question.torch.allclose() that your numerical gradient is close to autograd's gradient.import copy
N, d = 100, 5
dtype = torch.float64
X = torch.rand(N, d, dtype=dtype)
W, b = torch.rand(d, d, requires_grad=True, dtype=dtype), torch.rand(d, requires_grad=True, dtype=dtype)
def foo(W, b):
return torch.mean(X @ W + b)
loss = foo(W, b)
print(f"{loss=}")
def grad_foo(W, b):
return torch.mean(X @ W + b)
eps = 1e-6
grad_W = torch.zeros_like(W)
grad_b = torch.zeros_like(b)
for i in range(d):
for j in range(d):
# calculate gradient w.r.t. W
W_plus = copy.deepcopy(W)
W_plus[i, j] += eps
loss_plus = grad_foo(W_plus, b)
W_minus = copy.deepcopy(W)
W_minus[i, j] -= eps
loss_minus = grad_foo(W_minus, b)
grad_W[i, j] = (loss_plus - loss_minus) / (2 * eps)
# calculate gradient w.r.t. b
b_plus = copy.deepcopy(b)
b_plus[i] += eps
loss_plus = grad_foo(W, b_plus)
b_minus = copy.deepcopy(b)
b_minus[i] -= eps
loss_minus = grad_foo(W, b_minus)
grad_b[i] = (loss_plus - loss_minus) / (2 * eps)
loss.backward()
autograd_W = W.grad
autograd_b = b.grad
assert torch.allclose(grad_W, autograd_W)
assert torch.allclose(grad_b, autograd_b)
loss=tensor(1.9567, dtype=torch.float64, grad_fn=<MeanBackward0>)
Y contain? why this output shape?nn.Embedding yourself using only torch tensors. Answer :
Y is a tensor of shape (batch_size, seq_length, embedding_size). It contains the embeddings of the input sequence X, where each word in the sequence is represented by a vector of length embedding_size. The output shape is determined by the size of the input tensor X and the number of embedding dimensions specified when initializing the nn.Embedding layer. In this case, embedding_size is set to 128, which means that each word will be represented by a vector of length 128.nn.Embedding using only torch tensors, we can create a new tensor embedding_weight with shape (num_embeddings, embedding_dim) to serve as the trainable parameters for the embedding layer. We can then index into this tensor to obtain the embeddings for a given input sequence. The implementation of nn.Embedding can be written as follows:import torch
class Embedding(torch.nn.Module):
def __init__(self, num_embeddings, embedding_dim):
super().__init__()
# Initialize the embedding weights
self.embedding_weight = torch.nn.Parameter(torch.randn(num_embeddings, embedding_dim))
def forward(self, input_tensor):
# Index into the embedding weight tensor to get the embeddings
embeddings = self.embedding_weight[input_tensor]
return embeddings
In this implementation, num_embeddings is the size of the vocabulary, and embedding_dim is the number of embedding dimensions. The forward method takes as input an integer tensor input_tensor of shape (batch_size, seq_length) containing sequences of word indices, and returns a float tensor of shape (batch_size, seq_length, embedding_dim) containing the corresponding word embeddings.
import torch.nn as nn
X = torch.randint(low=0, high=42, size=(5, 6, 7, 8))
embedding = nn.Embedding(num_embeddings=42, embedding_dim=42000)
Y = embedding(X)
print(f"{Y.shape=}")
Y.shape=torch.Size([5, 6, 7, 8, 42000])
Answer:
True. Truncated Backpropagation Through Time (TBPTT) is a modification of the backpropagation algorithm used for training recurrent neural networks (RNNs) on sequence data. It works by breaking the sequence into smaller sub-sequences and performing backpropagation on each sub-sequence separately. The gradients computed during each sub-sequence are then accumulated and used to update the model parameters.
False. Implementing TBPTT involves not only limiting the length of the sequence provided to the model but also storing the hidden states at each time step for computing the gradients during backpropagation. During training, the forward pass is performed by running the model on a sequence of length S, and during the backward pass, the gradients are computed for each time-step in the sequence, and then truncated to be used only over the past S time-steps. The hidden states that were stored during the forward pass are then used to compute the gradients for the truncated backpropagation.
True. TBPTT allows the model to learn relations between inputs that are at most S timesteps apart. During training, the gradients only flow backward S timesteps, which means that the model can only learn dependencies between timesteps that are within this range. This is a limitation of TBPTT but can be addressed by choosing an appropriate sequence length that balances the ability of the model to capture long-term dependencies, and the computational cost of training the model.
Explain qualitatively what the addition of the attention mechanism between the encoder and decoder does to the hidden states that the encoder and decoder each learn to generate (for their language). How are these hidden states different from the model without attention?
After learning that self-attention is gaining popularity thanks to the shiny new transformer models, you decide to change the model from the tutorial: instead of the queries being equal to the decoder hidden states, you use self-attention, so that the keys, queries and values are all equal to the encoder's hidden states (with learned projections). What influence do you expect this will have on the learned hidden states?
Answer:
In machine translation, the addition of attention mechanism between the encoder and decoder provides the decoder with a mechanism to selectively focus on specific parts of the encoded source sequence, conditioned on the target sequence being generated. This is done by computing attention scores between each target hidden state and each of the source hidden states, producing a weighted sum (or attention weighted average) of the source hidden states based on these scores to generate an alignment context vector, which in turn is used to adjust the decoder hidden state. This allows the model to pay more attention to relevant parts of the source sequence at different points in the decoding process, which can significantly improve the quality of the generated translations, particularly for long sequences. The hidden states with attention are different from the model without attention in that they explicitly incorporate the context of the source sequence, which allows the model to generate target sequence tokens based on a weighted combination of the input sequence at each step instead of a summary from the entire input sequence encoded in the final hidden state.
If we change the queries, keys, and values to self-attention in the decoder, it would enable the decoder to recur over its own previously generated sequence while computing the context vector. This means, at each decoding step, the decoder can focus on different parts of the sequence generated so far to produce the next sequence element. This would allow the decoder to learn more abstract and generalized representations of the source sequence during decoding, and capture finer-grained dependencies in the target sequence with the input source sequence. However, it could also create a risk of attending to inconsistent parts of the decoder output, leading to instability in the decoding process. Therefore, to mitigate this, the self-attention mechanism is typically employed in addition to the source-derived attention mechanism in most sequence-to-sequence models, such as the Transformer architecture. Finally, by using self-attention, we could expect hidden states that are more naturally adaptive to the sequence data given that the weight applied to each hidden state is computed based on its relationship to other hidden states in the temporal context.
If we decide to use self-attention with the keys, queries, and values equal to the encoder's hidden states, it would provide a mechanism for the decoder to focus on different parts of the input sequence depending on the target sequence being generated. In this case, the decoder would attend to different parts of the encoded input sequence based on how they relate to each other, as opposed to using deterministic feature-wise projections. This would allow the decoder to learn more generalized patterns and structure from the input sequence as opposed to relying on a summary of the entire sequence.
Furthermore, using self-attention would provide more contextual information about the input sequence during encoding and decoding, allowing the model to capture longer distance dependencies in the input sequence as the model recursively attends to itself through the self-attention mechanism. This would allow for a better capture of the relationships between different elements in the input sequence, potentially leading to better translation quality.
However, using self-attention can sometimes lead to a reduction in model interpretability when compared to attention mechanisms that use pre-specified queries for decoding. This is because the self-attention mechanism generates queries based on the previously generated outputs instead of explicitly specified tokens, making it harder to interpret the internal workings of the model. Another disadvantage is that self-attention can be more computationally intensive than other attention mechanisms, such as content-based attention.
As we have seen, a variational autoencoder's loss is comprised of a reconstruction term and a KL-divergence term. While training your VAE, you accidentally forgot to include the KL-divergence term. What would be the qualitative effect of this on:
Images reconstructed by the model during training ($x\to z \to x'$)?
Answer
If the KL-divergence term is not included in the loss function during training of a VAE, it would prevent the model from effectively learning a useful latent representation of the input data.
Without the KL-divergence term in the loss function during training phase, the VAE model would essentially become a standard autoencoder. It would learn to map the input data to the latent space without any probabilistic interpretation, a deterministic mapping ($x \to Z$). While it may be able to reconstruct the input during training, this latent space may not have a high level of variance when samples are drawn from it, and there might not be any significant benefit of using the VAE architecture over a conventional autoencoder.
Similarly, if the KL-divergence term is not included during generation ($z \to x'$), it would result in the model being only able to generate deterministic outputs that are very similar to the inputs present in the training set. The model would be unable to generate diverse samples from the latent distribution, making it less useful for generative applications like sample generation or data augmentation. The presence of the KL-divergence term in the loss function controls the regularization of the probabilistic latent distribution and makes it possible to sample from the distribution for novel output image generation. If the KL-divergence term is not present, generation would be limited to deterministic outputs generated from single points in the latent space rather than a probabilistic output from a distribution of possible values.
Answer:
False. While it is true that in principle, the latent space distribution is a normal distribution, it is not necessarily $\mathcal{N}(\vec{0},\vec{I})$. In fact, during training, the VAE model tries to learn a better distribution that can represent the training data more effectively. Therefore, the latent-space distribution after training will depend on the specific data distribution of the input dataset.
False. The reconstruction obtained from multiple encodings of the same image may not be the same due to the stochasticity of the decoder's output. During training, the VAE is designed such that the noise is introduced into the latent space so that it can generate a diversity of outputs with high fidelity. Therefore, the output may vary between different runs of the decoder, even for the same encoded input.
True. The real VAE loss term involves computing the Kullback-Leibler (KL) divergence, which is intractable to compute directly. Therefore, during optimization, we resort to minimizing an upper bound on the KL divergence called the variational lower bound. The hope is that this bound is tight, meaning that it is close to the real loss value. By minimizing this bound, we can effectively train the VAE model.
Answer:
False. When training the discriminator, we don't need to backpropagate through it to update the generator. The generator is updated only based on how well the discriminator was fooled during the forward pass, which can be obtained without any backpropagation.
True. In GANs, the generator learns to generate images from a noise vector sampled from a prior distribution. Typically, this prior is chosen as $\mathcal{N}(\vec{0},\vec{I})$.
True. Initializing the discriminator weights to arbitrary values can cause the discriminator to output random scores at the start of training, which can cause the generator to update its weights to generate low-quality images. Pre-training the discriminator for a few epochs helps it to settle into a reasonable solution, which can stabilize the training process and lead to better generator performance.
False. If the discriminator reaches a stable state with 50% accuracy, it implies that it's no longer able to distinguish between real and generated images, and the generator has reached its performance limit. In such a scenario, further training of the generator won't improve the image quality, and it would be better to stop the training process.